Harmon is an innovative unified multimodal understanding and generation framework that coordinates visual representations for understanding and generation through a shared MAR encoder, demonstrating excellent performance in text-to-image generation and multimodal understanding tasks.
Text-to-Image
Safetensors English